class: center, middle, inverse, title-slide # Lecture 4 ## Statistical Models and Notation ### Psych 10 C ### University of California, Irvine ### 04/06/2022 --- ## Objective in research - One of our main objectives in research is to contrast our believes about the world with the outcomes of experiments. -- - We do so by starting with some "verbal" statement or belief about the world which we then formalize using a statistical model. -- - Statistical models will allow us to make predictions about future observations. In the case of an experiment, they will allow us to make predictions about the outcomes. -- - The next step is to evaluate those predictions by comparing them with the outcomes (data) of the experiment. -- - Finally, we would like to go back and interpret the results of our evaluations with respect to our original believes or statements about the world. --- ## Statistical models - Statistical models are abstract representations of the world. -- - They are a way in which we can formalize our believes about probabilistic events. -- - For example, if we have an experiment where we throw a coin and have two competing ideas about the coin: -- - The coin is **fair**. -- - The coin is **not fair**. -- - We can formalize these two believes using a statistical model. -- - The coin is fair: `\(P(\{heads\})\ =\ P(\{tails\})\ =\ 0.5\)` -- - The coin is not fair: `\(P(\{heads\})\ \neq\ P(\{tails\})\)` -- - We moved from two verbal statements about our believes regarding the coin to two formal statements about the probability of "heads". --- ## Statistical Models - Statistical models are the formal representation of our believes or hypothesis about the outcomes of an experiment. -- - Given that we assume that the outcomes are probabilistic, our models will have a probabilistic component associated with them. -- - Given the nature of our observations it will be almost impossible for us to tell if a model is TRUE or FALSE. However, we can compare how useful they are on a given situation. -- - Statistical models will allow us to make predictions about our observations, which we will then use to compare how useful they are. -- - However, before we continue it will be useful to introduce some notation! -- - This will provide us with a way to express our models in a formal and standard way. --- class: inverse, middle, center # Notation --- ## Example: - To introduce notation we will start with a problem. -- - **Problem:** We want to know if people that smoke have lower lung capacity in comparison with people that do not smoke. -- - We have a variable that we are interested in, which is lung capacity as measured by some standard test. -- - We also have a variable that indicates if a given participant smokes or not. -- - We call the first one a **dependent** variable, because we want to see how it "depends" on the values of another. -- - We call the smoker indicator variable an **independent** variable. We are interested in how our independent variable affects the values of our dependent variable. -- - In other words, we want to know if lung capacity is a function of smoking status. --- ## Example: Smoking - We collect data from 8 participants, 4 smokers and 4 non smokers. --
-- - We will denote values of our dependent variables using `\(y\)` for example, the first observation of our first group (non-smokers) is denoted as `\(y_{11}\)` while the fourth observation of the same group is denoted `\(y_{41}\)` --- ## Example: Smoking - In general we say that the *i-th* observation of the *j-th* group is denoted as `\(y_{ij}\)`. Note that the letters `\(i\)` and `\(j\)` are a way to denote a general observation, if we want to look at a particular one we can write: -- - `\(y_{21}=\)` 78.9 - `\(y_{32}=\)` 66.3 -- - Now that we have a notation for our observations, we need a way to describe their variability. -- - Remember that our objective is to formalize our beliefs or hypothesis about the world. -- - We know that our observations are probabilistic so we need a way to describe their variability. -- - In order to do this we will use the normal distribution. --- class: inverse, middle, center # The Normal (Gaussian) Distribution --- ## The Normal distribution - The normal distribution is one of the most used statistical models in the literature. -- - One of its main advantages is that it can be described using two parameters, `\(\mu\)` and `\(\sigma^2\)`. -- - We denote the Normal distribution as `\(\text{Normal}(\mu,\sigma^2)\)`. --- ## Standard Normal distribution - `\(\text{Normal}(\mu = 0,\sigma^2 = 1)\)` <img src="data:image/png;base64,#lec-4_files/figure-html/norm-examp-1.png" style="display: block; margin: auto;" /> --- ## Normal distribution - The first parameter of the normal distribution `\(mu\)` represents the center of the distribution. Notice that this is the value that has the highest density. -- - The second parameter `\(\sigma^2\)` (or `\(\sigma\)`) controls the dispersion of the normal distribution: -- - For example, two normal distributions with the same variance (or value of `\(\sigma^2\)`) can be drawn in R using: .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -4, to = 6, axes = FALSE, ann = FALSE, col = "red") curve(dnorm(x, mean = 2, sd = 1), col = "blue", add = T) box(bty = "l") axis(1, cex.axis = 1.3) mtext(text = "Support", side = 1, line = 2, cex = 1.6) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/norm-ex-1-out-1.png" style="display: block; margin: auto;" /> ] --- ## Normal distribution - An example of two normal distributions with the same value of `\(\mu\)` but different `\(\sigma^2\)` would be: .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -9, to = 9, axes = FALSE, ann = FALSE, col = "red") curve(dnorm(x, mean = 0, sd = 3), col = "blue", add = T) box(bty = "l") axis(1, cex.axis = 1.3) mtext(text = "Support", side = 1, line = 2, cex = 1.6) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/norm-ex-2-out-1.png" style="display: block; margin: auto;" /> ] -- - As we can see in the plot, as the standard deviation `\(\sigma\)` increases from 1 to 3, the normal distribution becomes more wide and less tall. -- - This indicates that `\(\sigma^2\)` and its root square `\(\sigma\)` control the variability of the distribution. --- ## Note - Once we have assigned a value to our parameters `\(\mu\)` and `\(\sigma^2\)` (or `\(\sigma\)` in R), we have defined a Normal distribution completely. -- - In other words, we know the density (height of the curve) assigned to each value of the random variable. --- ## Statistical models - Now that we have a notation for our observations `\(y_{ij}\)`. -- - And we have a statistical model in the Normal distribution. -- - We can start formalizing our hypothesis about the outcomes of an experiment. -- - Let's go back to the smoking example... --- ## Smoking: Null model - Remember that our problem was that we wanted to know if people that smoke have lower lung capacity in comparison with people that do not smoke. -- - We tested the lung capacity of 8 participants, 4 non-smokers and 4 smokers. -- - We will denote each observation of the non-smokers group as `\(y_{11},y_{21},y_{31},y_{41}\)` and each observation of our smokers group as `\(y_{12},y_{22},y_{32},y_{42}\)`. -- - For short, we can say that we denote with `\(i = 1,\dots,4\)` the observation number on group `\(j = 1,2\)` where `\(1\)` represents the non-smokers. -- - Both statements give us the same information, however, the second one is shorter. -- - Imagine if we had 50 observations in each group, listing all of them would take us a page... --- ## Smoking: Null model - Now we can think of two possibilities, the first one is that there are **no differences** between non-smokers and smokers. In other words, that even when our observations have some variability, lung capacity is not a function of smoking status. -- - The second one (and the one that we might be more interested in) is that lung capacity is a function of smoking status, in other words, that the groups are different. -- - The first model is known as the **NULL Model**. A model that states that there are no difference in lung capacity between groups! -- - We denote this model in the following way: `$$y_{ij} \sim \text{Normal}(\mu,\sigma^2)$$` --- ## Smoking: Null model - The **Null Model** formalizes the assumption that, regardless of the observation number `\(i\)`, and the group `\(j\)` (smoking status), participant are all described by the same parameters. -- - In other words, even though there might be some variability in our measures of lung capacity, non-smokers and smokers are all observations from the same distribution! -- - A Normal distribution that is centered at some value `\(\mu\)` and that has some variability `\(\sigma^2\)`. -- - Notice that we don't know the values of `\(\mu\)` and `\(\sigma^2\)` which define our statistical model, however, we can graph what the Null Model expects the data to look like. --- ## Graphical representation of the Null Model .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -4, to = 6, axes = FALSE, ann = FALSE, col = "red", lwd = 3) curve(dnorm(x, mean = 0, sd = 1), col = "blue", add = T, lty = 3, lwd = 3) box(bty = "l") mtext(text = "Lung capacity", side = 1, line = 2, cex = 1.6) legend("topleft", bty = "n", col = c("blue","red"), legend = c("non-smokers", "smokers"), lty = c(3, 1)) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/normal-null-out-1.png" style="display: block; margin: auto;" /> ] -- - Notice that we have not added any numbers on the x axis, this is because we don't know the values of the distribution, however, given the specification of the model, we know that it expects all of our observations to come from the same distribution! --- ## Statistical Inferece - Once we have defined our model, our new objective will be to find some suitable values for the parameters `\(\mu\)` and `\(\sigma^2\)` which define our statistical model. -- - This is known as Statistical Inference, and it will be the main objective of this class. -- - In general we can say that Statistical Inference refers to the process by which we "infer" or learn the values of the parameters of a statistical model based on our observations of the world (Data).